Constructing specialised corpora through analysing domain representativeness of websites
نویسندگان
چکیده
The role of the Web for text corpus construction is becoming increasingly significant. However, the contribution of the Web is largely confined to building a general virtual corpus or low quality specialised corpora. In this paper, we introduce a new technique called SPARTAN for constructing specialised corpora from the Web by systematically analysing website contents. Our evaluations show that the corpora constructed using our technique are independent of the search engines employed. In particular, SPARTAN-derived corpora outperform all corpora based on existing techniques for the task of term recognition.
منابع مشابه
بررسی وبگاههای ادارات کل کتابخانههای عمومی ایران: مطالعه وبسنجی
Purpose: Through analysis of different types of web links, it is aimed in this study to evaluate the status of links in provincial websites of Iran Public Libraries Foundation. Methodology: Link analysis as a webometric method was used in the present research. Data collection was accomplished by LexiURL software and Yahoo search engine. The population under study included the Provincial websit...
متن کاملComparing two acquisition systems for automatically building an English-Croatian parallel corpus from multilingual websites
In this paper we compare two tools for automatically harvesting bitexts from multilingual websites: bitextor and ILSP-FC. We used both tools for crawling 21 multilingual websites from the tourism domain to build a domain-specific English–Croatian parallel corpus. Different settings were tried for both tools and 10,662 unique document pairs were obtained. A sample of about 10% of them was manual...
متن کاملSurvey on the Status of Persian-Language Health Services through the Internet
Abstract Background: The Internet has been able to convert the manner of information seeking and has changed the users’ approach to information particularly in health domain. In this regard, the number of Persian-language websites in health service are increasing. Therefore, information about the variety of services offered by them is very important. The present study was designed to describe ...
متن کاملA THESIS PROPOSAL about USING ZIPF FREQUENCIES AS A REPRESENTATIVENESS MEASURE IN STATISTICAL ACTIVE LEARNING OF NATURAL LANGUAGE
Active learning has proven to be a successful strategy in quick development of corpora to be used in statistical induction of natural language. A vast majority of studies in this field has concentrated on finding and testing various informativeness measures for samples; however, representativeness measures for selected samples have not been thoroughly studied. In this thesis, we intend to intro...
متن کاملVerbs in specialised corpora: from manual corpus-based description to automatic extraction in an English-French parallel corpus
This paper tackles the issue of verbs in specialised corpora in the view of term extraction. Corpus-based manual descriptions to be used in various applications have highlighted the “deviant” uses of verbs in specialised corpora compared with general uses as well as the need for verb extraction. However, very few attention has been given to verbs both in the terminology theory and automatic ter...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Language Resources and Evaluation
دوره 45 شماره
صفحات -
تاریخ انتشار 2011